Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38094] Enable matching schema column names by field ids #35385

Closed
wants to merge 14 commits into from
Closed

[SPARK-38094] Enable matching schema column names by field ids #35385

wants to merge 14 commits into from

Conversation

jackierwzhang
Copy link
Contributor

@jackierwzhang jackierwzhang commented Feb 3, 2022

What changes were proposed in this pull request?

Field Id is a native field in the Parquet schema (https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398)

After this PR, when the requested schema has field IDs, Parquet readers will first use the field ID to determine which Parquet columns to read if the field ID exists in Spark schema, before falling back to match using column names.

This PR supports:

  • Vectorized reader
  • parquet-mr reader

Why are the changes needed?

It enables matching columns by field id for supported DWs like iceberg and Delta. Specifically, it enables easy conversion from Iceberg (which uses field ids by name) to Delta, and allows id mode for Delta column mapping

Does this PR introduce any user-facing change?

This PR introduces three new configurations:

spark.sql.parquet.fieldId.write.enabled: If enabled, Spark will write out native field ids that are stored inside StructField's metadata as parquet.field.id to parquet files. This configuration is default to true.

spark.sql.parquet.fieldId.read.enabled: If enabled, Spark will attempt to read field ids in parquet files and utilize them for matching columns. This configuration is default to false, so Spark could maintain its existing behavior by default.

spark.sql.parquet.fieldId.read.ignoreMissing: if enabled, Spark will read parquet files that do not have any field ids, while attempting to match the columns by id in Spark schema; nulls will be returned for spark columns without a match. This configuration is default to false, so Spark could alert the user in case field id matching is expected but parquet files do not have any ids.

How was this patch tested?

Existing tests + new unit tests.

@github-actions github-actions bot added the SQL label Feb 3, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening a PR. I left a few comments and would appreciate it if you could address them.

val PARQUET_FIELD_ID_ENABLED =
buildConf("spark.sql.parquet.fieldId.enabled")
.doc("Field ID is a native field of the Parquet schema spec. When enabled, Parquet readers" +
" will use field IDs (if present) in the requested Spark schema to look up Parquet" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work when there is a mixture of columns that have field id set and ones that don't?

Copy link
Contributor Author

@jackierwzhang jackierwzhang Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would try to match by id if id exists, otherwise, it would fall back to match by name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the code would use field ids if the flag is enabled, if the flag is disabled, the code would use names instead. My main concern is ambiguity resolution in schema.

Copy link
Contributor Author

@jackierwzhang jackierwzhang Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I meant even when this flag is enabled, my statement above still applies: the matching is a best-effort basis.

Disabling this flag will complete avoid reading and writing field ids.

@huaxingao
Copy link
Contributor

does not support:

  • Parquet-mr reader due to lack of field id support (needs a follow up ticket)

Just for my own knowledge: what needs to be done to make parquet-mr support field id?

@jackierwzhang
Copy link
Contributor Author

does not support:

  • Parquet-mr reader due to lack of field id support (needs a follow up ticket)

Just for my own knowledge: what needs to be done to make parquet-mr support field id?

I am still investigating, previously I thought it requires support from parquet-mr, but now looks like it's not necessary.

I am working on a fix locally, which might be pushed out as part of this PR or another.

@huaxingao
Copy link
Contributor

@jackierwzhang
FYI: I am working with @shangxinli on column id resolution in parquet-mr link, with pretty much the same motivation as yours. The work will probably overlap with yours.
One thing that I just realized is that the field id can be NOT unique in schema. For example:

message ParquetSchema {
  required group reqMap (MAP) = 1 {
    repeated group key_value (MAP_KEY_VALUE) {
      required binary key (STRING);
      optional group value (MAP) {
        repeated group key_value (MAP_KEY_VALUE) {
          required binary key (STRING);
          optional group value {
            required binary name (STRING) = 1;
            optional binary age (STRING) = 2;
            optional binary gender (STRING) = 3;
            optional group addedStruct = 4 {
              required binary name (STRING) = 1;
              optional binary age (STRING) = 2;
              optional binary gender (STRING) = 3;
            }
          }
        }
      }
    }
  }
}

I probably need to change the format specification to make the field id unique.

@jackierwzhang
Copy link
Contributor Author

jackierwzhang commented Feb 7, 2022

@huaxingao

Got it.

As for duplicated field id, I think in my approach, reading parquet files with duplicated id across different groups are allowed, essentially we just don't want confusion when matching fields which are on the same level in the schema.

Btw just curious, since you have been working on field id resolution for parquet-mr, do you know whether it currently supports reading and writing field ids yet?

@huaxingao
Copy link
Contributor

I think in my approach, reading parquet files with duplicated id across different groups are allowed, essentially we just don't want confusion when matching fields which are on the same level in the schema.

Sounds reasonable. I hope I can do the same too, but seems to me that I need to resolve the column by id only, which requires that the id to be unique in the entire schema. This is going to be a breaking change. Not sure if I am allowed to do it or not.

It doesn't seem to me that parquet-mr supports reading and writing field ids yet. The field ids are not in ColumnDescriptor.

@huaxingao
Copy link
Contributor

@jackierwzhang
No, those are set correctly.
What I meant is that the field ids are not really used. Seems only the ColumnPath is used in column index, column resolution, etc. I am thinking of adding field id in ColumnDescriptor and keeping a map between id and ColumnDescriptor, or a map between id and ColumnPath.

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I left a few comments and would appreciate it if you could take a look. Thanks!

@jackierwzhang
Copy link
Contributor Author

@jackierwzhang No, those are set correctly. What I meant is that the field ids are not really used. Seems only the ColumnPath is used in column index, column resolution, etc. I am thinking of adding field id in ColumnDescriptor and keeping a map between id and ColumnDescriptor, or a map between id and ColumnPath.

Got it. I was asking because I tested locally and found that parquet-mr can actually save and read field ids via Spark, so I don't have to patch anything for the parquet-mr repo.

Tho there are a couple of small problems remaining for id matching on the parquet-mr side, I believe It's possible to extend this PR (or open another) to enable spark to match by id in that code path; I'm gonna do that soon.

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I left a few minor comments, I would appreciate it if you could follow up. Thank you!

SQLConf.SHUFFLE_PARTITIONS.key -> "5")
SQLConf.SHUFFLE_PARTITIONS.key -> "5",
// Enable parquet read field id for tests to ensure correctness
SQLConf.PARQUET_FIELD_ID_READ_ENABLED.key -> "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we will not test match by name and will always test by field id?

Copy link
Contributor Author

@jackierwzhang jackierwzhang Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Again, enabling this flag would only try to match field ids if they exist, but disabling this flag will completely ignore matching using field id. so if I read with a spark schema that has no ids at all, and turn on this flag, it would be exactly the same as name matching.

I wanted to enable this flag for all tests to detect any regressions in existing test cases, in case when this flag is turned on by default in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they would exist once we start writing field ids for all of the fields, would not they?

Copy link
Contributor Author

@jackierwzhang jackierwzhang Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but it requires the original schema to contain parquet.field.id metadata, which is not present in any of the existing suites, so it should behavior exactly like name matching.

Turning this on actually ensures that we didn't introduce any regression for existing code under this mixed matching mode, and detects if this metadata field has been used anywhere.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me. Just some cosmetic comments.

def hasFieldId(field: StructField): Boolean =
field.metadata.contains(FIELD_ID_METADATA_KEY)

def getFieldId(field: StructField): Int = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can consider to combine getFieldId and hasFieldId into a single method:

def getFieldId(field: StructField): Option[Int]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, this is fine. I see that hasField() is used separately, and the assertion would still have be implemented somewhere anyway. As long as there is a test for this, it should be good.

def hasFieldId(field: StructField): Boolean =
field.metadata.contains(FIELD_ID_METADATA_KEY)

def getFieldId(field: StructField): Int = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, this is fine. I see that hasField() is used separately, and the assertion would still have be implemented somewhere anyway. As long as there is a test for this, it should be good.

SQLConf.SHUFFLE_PARTITIONS.key -> "5")
SQLConf.SHUFFLE_PARTITIONS.key -> "5",
// Enable parquet read field id for tests to ensure correctness
SQLConf.PARQUET_FIELD_ID_READ_ENABLED.key -> "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they would exist once we start writing field ids for all of the fields, would not they?

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think you can remove WIP label from your PR as it is not longer work in progress.

Approved pending addressed comments.

@jackierwzhang jackierwzhang changed the title [WIP][SPARK-38094] Enable matching schema column names by field ids [SPARK-38094] Enable matching schema column names by field ids Feb 10, 2022
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in b5eae59 Feb 18, 2022
dongjoon-hyun pushed a commit that referenced this pull request Mar 2, 2022
### What changes were proposed in this pull request?
Minor follow ups on #35385:
1. Add a nested schema test
2. Fixed an error message.

### Why are the changes needed?
Better observability.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing test

Closes #35700 from jackierwzhang/SPARK-38094-minor.

Authored-by: jackierwzhang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants